TODO: Will Provide GitHub Link to Download QMD and HTML. HOW TO INSTALL PACKAGES?
Many of the most popular technologies for ST can be divided into two groups:
Spot-based: Next generation sequencing (NGS) with spatial barcoding
Molecule-based: In-situ imaging of individual molecules
FOOTNOTE REF
10X Genomics Visium Youtube Video
Colored probes attach to mRNA transcript from a target gene.
Quantify expression by imaging
Key challenge is extending to many genes
MERFISH uses (error robust) combinatorial labeling to increase the number of gene transcripts that can be measured.
Basic idea: assign a \(N\)-bit binary string to each gene. Then \(2^N\) genes can be encoded and measured after \(N\) sequential rounds of smFISH.
Source: Fig 1A of Chen et al. (2015). Science.
| Visium (Spot-based) | Molecule-based | |
|---|---|---|
| Spatial Resolution | 1-10 cells per spot | sub-cellular |
| Number of genes | Whole transcriptome (20,000+) | 100-10,000 (MERFISH) |
| Detection efficiency | ~15% | ~95% |
| Accessibility | Commercially available; raw data FASTQ | Raw images can be large and difficult to analyze |
Moses and Pachter (2022). Nature Methods.
Moses and Pachter (2022). Nature Methods.
Moses and Pachter (2022). Nature Methods.
Once processed, ST data can be represented using two matrices.
Count matrix \(Y\): For each of \(J\) spots record the number of reads from each of \(I\) genes.
Coordinate matrix \(X\): For each of \(J\) spots record the 2-dimensional spatial coordinate.
FIG
The SpatialExperiment package provides a container to store the count and coordinate matrix. ADD REF
Install the package:
Load the package:
The 10X genomics website has example ST datasets.
For this workshop, we will download samples from human cerebral cortex:
The data can be downloaded by navigating to the website or directly from the terminal (Unix):
mkdir cortex
cd cortex
curl -O https://cf.10xgenomics.com/samples/spatial-exp/1.1.0/V1_Human_Brain_Section_1/V1_Human_Brain_Section_1_raw_feature_bc_matrix.tar.gz
curl -O https://cf.10xgenomics.com/samples/spatial-exp/1.1.0/V1_Human_Brain_Section_1/V1_Human_Brain_Section_1_spatial.tar.gz
tar xf V1_Human_Brain_Section_1_raw_feature_bc_matrix.tar.gz
tar xf V1_Human_Brain_Section_1_spatial.tar.gzdir <- "./cortex"
### Step 1: Load the count matrix
require(DropletUtils)
fnm <- file.path(dir, "raw_feature_bc_matrix")
sce <- DropletUtils::read10xCounts(fnm)
### Step 2: Load the tissue image
img <- readImgData(
path = file.path(dir, "spatial"),
sample_id = "brain")
# Step 3: Read spatial coordinates
fnm <- file.path(dir, "spatial", "tissue_positions_list.csv")
xy <- read.csv(fnm, header = FALSE,
col.names = c(
"barcode", "in_tissue", "array_row", "array_col",
"pxl_row_in_fullres", "pxl_col_in_fullres"))
ix <- match(sce$Barcode, xy$barcode) #Get cells in correct order
xy <- xy[ix,]
### Step 4: Construct feature metadata
require(S4Vectors)
rd <- S4Vectors::DataFrame(
symbol = rowData(sce)$Symbol)
### Step 5: Create the object
spe <- SpatialExperiment(
assays = list(counts = assay(sce)),
rowData = rd,
colData = DataFrame(xy),
spatialCoordsNames = c("pxl_col_in_fullres", "pxl_row_in_fullres"),
imgData = img,
sample_id = "brain")
### Step 6: Save the object
saveRDS(spe, file=file.path(dir,"spe.RDS"))require(ggplot2)
df <- data.frame(x=spatialCoords(spe)[,1], y=spatialCoords(spe)[,2],
in_tissue=as.character(spe$in_tissue))
p <- ggplot(df,aes(x=x,y=y,color=in_tissue))+
geom_point(size=0.25)+
scale_color_manual(values=c("grey", "red"))+
theme_bw()
p
spe <- spe[,spe$in_tissue == 1] #This is how you subset by spots/samplesFor additional examples of data in the SpatialExperiment format, a good package is STexampleData :
A spatially variable gene is one whose expression depends on spatial location
In the human cerebral cortex, the genes MOBP, NPY, SNAP25, and PCP4 were previously reported by REF to be spatially variable. We will verify this by plotting.
Problem: Instead of gene symbols we are given ENSEMBL ID:
### Convert ENSEMBL ID to gene symbol
require(biomaRt)
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- rownames(spe)
G_list <- getBM(filters="ensembl_gene_id",
attributes= c("ensembl_gene_id","hgnc_symbol"),
values=genes,
mart=mart)
no.match <- which(G_list$hgnc_symbol == "")
G_list[no.match,2] <- G_list[no.match,1]
ix <- match(G_list$ensembl_gene_id, rownames(spe))
rownames(spe)[ix] <- G_list$hgnc_symbol
interesting_genes <- which(rownames(spe) %in% c("MOBP","PCP4", "SNAP25","NPY"))### Normalize count matrix and subset to interesting genes
require(scran)
spe <- logNormCounts(spe)
Y.sub <- as.matrix(logcounts(spe)[interesting_genes,])
## Switch rows and columns and make dataframe
df <- as.data.frame(t(Y.sub))
## Add spatial coordinates
df$x <- spatialCoords(spe)[,1]; df$y <- spatialCoords(spe)[,2]### Construct plot
require(reshape2)
df <- melt(df, id.vars=c("x","y"))
p <- ggplot(data=df,aes(x=x,y=y,color=value))
p <- p + geom_point(size=0.05)
p <- p + coord_fixed()
p <- p + scale_color_gradientn(colors=c("gray90","blue","black"),
breaks=c(0,3,6))
p <- p + facet_wrap(~variable, nrow=2)
p <- p + theme_bw()
p <- p + guides(color = guide_colorbar(ticks = FALSE))
p <- p + labs(color="log count")
p <- p + theme(strip.text = element_text(face = "italic"),
panel.grid = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())A statistical method can automatically identify spatially variable genes without relying on previous knowledge.
There are several existing methods, but it is still an area of active research.
METHOD OVERVIEW
require(scran)
interesting_genes <- rownames(res$res_mtest)[order(res$res_mtest$combinedPval)[1:10]]
spe <- logNormCounts(spe)
Y.sub <- as.matrix(logcounts(spe)[interesting_genes,])
## Switch rows and columns and make dataframe
df <- as.data.frame(t(Y.sub))
## Add spatial coordinates
df$x <- spatialCoords(spe)[,1]; df$y <- spatialCoords(spe)[,2]
### Construct plot
require(reshape2)
df <- melt(df, id.vars=c("x","y"))
p <- ggplot(data=df,aes(x=x,y=y,color=value))
p <- p + geom_point(size=0.05)
p <- p + coord_fixed()
p <- p + scale_color_gradientn(colors=c("gray90","blue","black"),
breaks=c(0,4,8))
p <- p + facet_wrap(~variable, nrow=2)
p <- p + theme_bw()
p <- p + guides(color = guide_colorbar(ticks = FALSE))
p <- p + labs(color="log count")
p <- p + theme(strip.text = element_text(face = "italic"),
panel.grid = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())
pOur dataset has over \(30,000\) genes. However, many appear to be highly correlated (see the top SVG plot we made earlier).
If we account for the strong correlation, we may be able to summarize the data using a much smaller number of genes, thus reducing the “dimension”.
Informally, PCA begins by finding the line that best approximates the data (PC dimension 1). Next, it finds a line perpendicular to PC dimension 1 that best approximates the data (PC dimension 2). Then repeat to find PC dimension 3, 4, etc.
To reduce the data to dimension \(k\), we can project each data point onto the linear space spanned by the first \(k\) PC dimensions.
For example, when \(k = 1\) we project each point onto a line. When \(k = 2\), we project each point onto a plane.
For computational speed, it is standard practice to subset to the ~2000 most highly variable genes before running PCA:
What do you notice about the genes that contribute to PC dimension 1?
df <- data.frame(x=spatialCoords(spe)[,1],
y=spatialCoords(spe)[,2],
PC1=reducedDim(spe)[,1],
PC2=reducedDim(spe)[,2],
PC3=reducedDim(spe)[,3])
df <- df |> pivot_longer(cols=c(PC1,PC2,PC3))
p <- ggplot(data=df,aes(x=x,y=y,color=value)) +
geom_point(size=0.5) +
coord_fixed()+
scale_color_gradient2(low="blue",mid="grey", high="red") +
theme_bw()+
facet_wrap(~name,nrow=1)
pThe PCA results show that there are groups of spots that have very similar expression levels. By applying a clustering algorithm we can explicitly define these groups of cells. Many approaches for clustering first build the shared nearest neighbor (SNN) graph of the cells.
INSERT FIG
A marker gene is one that is differentially expressed in a particular cluster. Marker genes are used to better understand the biological function of the identified clusters.
for(i in 1:length(markers)) {
print(paste("Cluster", i))
print(rownames(markers[[i]])[order(markers[[i]]$mean.AUC, decreasing=TRUE)[1:5]])
}[1] "Cluster 1"
[1] "SNAP25" "SYT1" "NEFL" "VSNL1" "NRGN"
[1] "Cluster 2"
[1] "MBP" "MOBP" "PLP1" "BCAS1" "SHTN1"
[1] "Cluster 3"
[1] "MBP" "PLP1" "MOBP" "CNP" "MTURN"
[1] "Cluster 4"
[1] "SLC1A2" "MAP1A" "OLFM1" "ENC1" "NEFL"
[1] "Cluster 5"
[1] "ENSG00000269028" "ENSG00000255823" "CXCL14" "GLUL"
[5] "CST3"
[1] "Cluster 6"
[1] "MAP1A" "MAP1B" "SNAP25" "SYT1" "SLC17A7"
[1] "Cluster 7"
[1] "SPARC" "AQP4" "FOS" "SPARCL1" "CLU"
Per REF,
MOBP is a white matter/oligodendrocyte marker (Cluster 4)
SNAP25 is a gray matter/neuron marker (Cluster 2)
Ji, N., & Van Oudenaarden, A. (2012). Single molecule fluorescent in situ hybridization (smFISH) of C. elegans worms and embryos.